Overview of the Data
Defining our Dataset
Transforming the Data
Reducing Noise with Feature Selection
Class Balance
Benchmark Model
Scoring with the ROC Curve
Class Imbalance with PySpark
Model Performance Metrics
Benchmark Model
Model Performance with the Classification Report
Scoring with ROC_AUC
Visualizing the Decision Tree
Model Performance Metrics
Benchmark Model
Model Performance with the Classification Report
data_overview = pd.read_csv("../Celerative_Working/Data_Cleaning/data_overview.csv")
data_overview.drop(["Unnamed: 0"], axis=1, inplace=True)
data_overview.head(5)
| name | profile | skills | |
|---|---|---|---|
| 0 | Hugo L. Samayoa | DevOps Developer | ['Agile Software Development', 'Kubernetes', '... |
| 1 | Stepan Yakovenko | Software Developer | ['Debian Linux', 'MySQL', 'HTML5', 'GWT', 'Aja... |
| 2 | Slobodan Gajic | Software Developer | ['Firebase', 'JSON', 'CSS', 'Angular', 'jQuery... |
| 3 | Bruno Furtado Montes Oliveira | Visual Studio Team Services (VSTS) Developer | ['Agile', 'Debian Linux', 'PostgreSQL', 'Pytho... |
| 4 | Jennifer Aquino | Query Optimization Developer | ['Automation', 'SQL Server Integration Service... |
| Observations or Samples | Classes, Predicted vs. Actual (y) | Features (X) |
|---|---|---|
| observation #1 | profile_A | skill_1, skill_5, skill_3... |
| observation #2 | profile_B | skill_5, skill_2, skill_4... |
| observation #3 | profile_C | skill_4, skill1_2, skill_1... |
| observation #4 | profile_B | skill_4, skill1_1, skill_5... |
| observation #5 | profile_A | skill_3, skill1_4, skill_5... |
| Similar | Strings | |
|---|---|---|
| 'apache' | 'apache spark' | |
| 'apache ant' | 'apache tomcat' | |
| 'rest' | 'rest apis' | |
| 'android studio' | 'android' | 'android sdk' |
| 'asp.net mvc' | 'asp.net' | 'asp' |
| 'visual studio code' | 'microsoft visual studio' | |
| 'angular js' | 'angular' | |
| 'c#' | 'c' | 'c++' |
| 'asp.net mvc' | 'asp.net' | |
| 'google app engine' | 'google cloud' | 'google compute engine' |
| 'google analytics' | 'google adsense' | 'google adwords', |
| 'phpstorm' | 'php' | |
| 'net' | 'net framework' | |
| 'aws ec2' | 'aws s3 |
print("Before the data was cleaned and processed, the dataset consisted of:\n")
print(
"\t - ",
data_overview.profile.nunique(),
'unique profile types or "classes" \n',
)
print("\t & ")
print(
"\n\t - ",
data_overview.skills.explode().nunique(),
'skills or "features"',
)
Before the data was cleaned and processed, the dataset consisted of: - 2672 unique profile types or "classes" & - 5496 skills or "features"
print(
"Using Natural Language Processing and Text Pre-Processing techniques provided via the Scikit-Learn library, skills & profile strings were standardized to:"
)
print("\t - ", clean_talentpool.profile.nunique(), "profiles (classes)")
print("\t & ")
print("\t - ", clean_talentpool.skills.explode().nunique(), "skills (features)")
Using Natural Language Processing and Text Pre-Processing techniques provided via the Scikit-Learn library, skills & profile strings were standardized to: - 415 profiles (classes) & - 780 skills (features)
# Generating wordcloud
wordcloud = WordCloud(
width=3000,
height=2000,
random_state=1,
max_words=250,
background_color="black",
colormap="Blues",
collocations=False,
stopwords=STOPWORDS,
).generate(
series_to_str(clean_talentpool["skills"])
) # Plot
plot_cloud(wordcloud)
pd.DataFrame(clean_talentpool["profile"].value_counts())[1:19]
| profile | |
|---|---|
| quality assurance | 79 |
| front-end developer | 69 |
| javascript developer | 64 |
| ios developer | 55 |
| full-stack developer | 54 |
| frontend php javascript | 51 |
| machine learning engineer | 50 |
| python developer | 48 |
| php developer | 47 |
| java developer | 41 |
| android developer | 31 |
| javascript frontend design | 28 |
| c# developer | 24 |
| ruby on rails developer | 22 |
| architecture developer | 22 |
| software architect | 21 |
| design | 20 |
| windows presentation foundation (wpf) developer | 19 |
It's clear that a few of the top profile types are described by their key skills only, such as "frontend php javascript" , "javascript frontend design", etc. These top "keywords" may reveal more than about the data's profiles when isolated from non-descritptive words like "developer" and "engineer".
# Drop empty row & subsetting for top 25 keywords
dotplot.drop(14, inplace=True)
dotplot.reset_index(drop=True)[:25]
| profile_keywords | count | |
|---|---|---|
| 0 | javascript | 411 |
| 1 | frontend | 339 |
| 2 | php | 164 |
| 3 | design | 157 |
| 4 | fullstack | 150 |
| 5 | ios | 119 |
| 6 | senior | 118 |
| 7 | python | 102 |
| 8 | java | 97 |
| 9 | ruby | 91 |
| 10 | android | 90 |
| 11 | web | 87 |
| 12 | qualityassurance | 79 |
| 13 | data | 77 |
| 14 | net | 73 |
| 15 | machinelearning | 60 |
| 16 | devops | 56 |
| 17 | architecture | 53 |
| 18 | c | 43 |
| 19 | rails | 37 |
| 20 | middle | 36 |
| 21 | backend | 30 |
| 22 | testing | 30 |
| 23 | wordpress | 30 |
| 24 | system | 27 |
# Draw plot
y = dotplot.profile_keywords[:45]
fig, ax = plt.subplots(figsize=(10, 12), dpi=80)
ax.hlines(
y=y,
xmin=10,
xmax=205,
color="gray",
alpha=0.7,
linewidth=1,
linestyles="dashdot",
)
ax.scatter(
y=y,
x=dotplot.loc[:45, ["count"]],
s=75,
color="firebrick",
alpha=0.7,
)
# Title, Label, Ticks and Ylim
ax.set_title("Most Common Profile Keywords", fontdict={"size": "22"})
ax.set_xlabel("Count of Keywords", fontdict={"size": "22"})
ax.set_yticklabels(y, fontdict={"horizontalalignment": "right", "size": "14"})
ax.set_xlim(5, 200)
plt.show()
# Get percentage of developer profiles
(five_profiles.profile.value_counts(normalize=True) * 100).astype(int)
Frontend 48 Fullstack 21 QA 11 Database 9 ML 8 Name: profile, dtype: int32
print(
"Although profile types have been cut to 5 broad profile categories, \
\nthere are still",
ex_skills.skills.nunique(),
"skills (vs. the 812 seen with all profiles). \
\nSo although many of the classes have been cut, most skills remain,\
\nqualifying the exploration of this particular subset of the data as valid.",
)
Although profile types have been cut to 5 broad profile categories, there are still 527 skills (vs. the 812 seen with all profiles). So although many of the classes have been cut, most skills remain, qualifying the exploration of this particular subset of the data as valid.
| Observations/Samples | Classes (y) | Features (X) |
|---|---|---|
| observation #1 | profile_A | skill_1, skill_5, skill_3... |
| observation #2 | profile_B | skill_5, skill_2, skill_4... |
| observation #3 | profile_C | skill_4, skill1_2, skill_1... |
| observation #4 | profile_B | skill_4, skill1_1, skill_5... |
| observation #5 | profile_A | skill_3, skill1_4, skill_5... |
| Observations/Samples | Classes (y) | Feature1 (X) | Feature2 (X) | Feature3 (X) | Feature4 (X)... |
|---|---|---|---|---|---|
| observation #1 | profile_A | skill_1 | skill_2 | skill_3 | skill_4 ... |
| observation #2 | profile_B | skill_1 | skill_2 | skill_3 | skill_4 ... |
| observation #3 | profile_C | skill_1 | skill_2 | skill_3 | skill_4 ... |
| observation #4 | profile_B | skill_1 | skill_2 | skill_3 | skill_4 ... |
| observation #5 | profile_A | skill_1 | skill_2 | skill_3 | skill_4 ... |
sklearn's MultiLabelBinarizer() Object is used to take each skill and expand it, so that each skill (feature) has its own column. Should that sample, or developer, have the specified skill, the corresponding row is marked with a 1. Otherwise, it's marked with a 0.
This effectively converts all skills to binary variables, which allows the features (skills) to be represented on the x-axis, while observations are represented on the y-axis. Machine Learning algorithms need data formatted in this way in order to process it correctly.
df = binarized(ex_skills)
df.iloc[:, 1:8]
| profile | skills | .net | .net 4 | .net core | .net framework | ad planningand buying | |
|---|---|---|---|---|---|---|---|
| 0 | Fullstack | [rest, visual studio code, json, full-stack, l... | 0 | 0 | 0 | 0 | 0 |
| 1 | Frontend | [html, ux design, javascript, type script, red... | 0 | 0 | 0 | 0 | 0 |
| 2 | ML | [data science, linux, apache hive, python 3, m... | 0 | 0 | 0 | 0 | 0 |
| 3 | ML | [data science, flask, mysql, python, linux rhe... | 0 | 0 | 0 | 0 | 0 |
| 4 | QA | [javascript, ruby on rails (ror), jenkins] | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 628 | Database | [data science, postgresql, r, aws lambda, opti... | 0 | 0 | 0 | 0 | 0 |
| 629 | Frontend | [html, javascript, php, laravel, wordpress, ph... | 0 | 0 | 0 | 0 | 0 |
| 630 | Frontend | [html, java, spring, javascript, node, js, dat... | 0 | 0 | 0 | 0 | 0 |
| 631 | Frontend | [html, automated testing, phantom js, wirefram... | 0 | 0 | 0 | 0 | 0 |
| 632 | ML | [test-driven development (tdd), amazon web ser... | 0 | 0 | 0 | 0 | 0 |
633 rows × 7 columns
is recognized as the variable which stores the "input variables", or features on which the algorithm is trained.
Here features are assigned to X:
# Assign features to X
X = df.iloc[:, 5:]
X
| .net core | .net framework | ad planningand buying | ado.net | adobe illustrator | adobe photoshop | affiliate marketing | afnetworking | agile | ajax | ... | writing | wwf | x | xamarin | xcode | xml | yii | zend framework | zeplin | zurb foundation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 628 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 629 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 630 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 631 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 632 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
633 rows × 525 columns
is recognized as the variable which stores class labels in classification problems.
Here class labels are assigned to y:
# Assign target class to y
y = df.profile
y
0 Fullstack
1 Frontend
2 ML
3 ML
4 QA
...
628 Database
629 Frontend
630 Frontend
631 Frontend
632 ML
Name: profile, Length: 633, dtype: object
involves choosing features of the data that are more relevant to the model performing well.
This makes data input to the model less noisy, lets the model train faster, simplifies the model for interpretation, and improves generalization to outside data.
# Create the correlation matrix
corr_matrix = X.corr()
# Subset the matrix
corr_matrix.iloc[:11, :11]
| .net core | .net framework | ad planningand buying | ado.net | adobe illustrator | adobe photoshop | affiliate marketing | afnetworking | agile | ajax | akamai | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| .net core | 1.000000 | -0.001582 | -0.007897 | -0.001582 | -0.001582 | -0.004500 | -0.002745 | -0.001582 | -0.029539 | -0.009329 | -0.003172 |
| .net framework | -0.001582 | 1.000000 | -0.007897 | -0.001582 | -0.001582 | -0.004500 | -0.002745 | -0.001582 | -0.029539 | 0.169613 | -0.003172 |
| ad planningand buying | -0.007897 | -0.007897 | 1.000000 | -0.007897 | -0.007897 | -0.022460 | 0.106738 | -0.007897 | 0.198199 | -0.046556 | 0.088553 |
| ado.net | -0.001582 | -0.001582 | -0.007897 | 1.000000 | -0.001582 | -0.004500 | -0.002745 | -0.001582 | -0.029539 | -0.009329 | -0.003172 |
| adobe illustrator | -0.001582 | -0.001582 | -0.007897 | -0.001582 | 1.000000 | -0.004500 | -0.002745 | -0.001582 | -0.029539 | -0.009329 | -0.003172 |
| adobe photoshop | -0.004500 | -0.004500 | -0.022460 | -0.004500 | -0.004500 | 1.000000 | -0.007807 | -0.004500 | -0.054471 | -0.026533 | -0.009022 |
| affiliate marketing | -0.002745 | -0.002745 | 0.106738 | -0.002745 | -0.002745 | -0.007807 | 1.000000 | -0.002745 | 0.044868 | -0.016183 | -0.005503 |
| afnetworking | -0.001582 | -0.001582 | -0.007897 | -0.001582 | -0.001582 | -0.004500 | -0.002745 | 1.000000 | -0.029539 | -0.009329 | -0.003172 |
| agile | -0.029539 | -0.029539 | 0.198199 | -0.029539 | -0.029539 | -0.054471 | 0.044868 | -0.029539 | 1.000000 | -0.144463 | 0.065734 |
| ajax | -0.009329 | 0.169613 | -0.046556 | -0.009329 | -0.009329 | -0.026533 | -0.016183 | -0.009329 | -0.144463 | 1.000000 | -0.018702 |
| akamai | -0.003172 | -0.003172 | 0.088553 | -0.003172 | -0.003172 | -0.009022 | -0.005503 | -0.003172 | 0.065734 | -0.018702 | 1.000000 |
# Selecting for those columns with correlation relationship less than 50%
columns = np.full((corr_matrix.shape[0],), True, dtype=bool)
for i in range(corr_matrix.shape[0]):
for j in range(i + 1, corr_matrix.shape[0]):
if corr_matrix.iloc[i, j] >= 0.5:
if columns[j]:
columns[j] = False
By selecting for features more than 50% dissimilar, features have been reduced from 526 to 278, effectively reducing the "dimensionality" or column vectors of the dataset. This is one of an array of techniques used to select for the most important features of a dataset.
# newly selected features
X
| .net core | .net framework | ad planningand buying | ado.net | adobe illustrator | adobe photoshop | affiliate marketing | afnetworking | agile | ajax | ... | web developer | web gl | webpack | windows | windows mobile | windows powershell | windows presentation foundation (wpf) | xamarin | xml | zurb foundation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 628 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 629 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 630 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 631 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 632 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
633 rows × 278 columns
# Performing RFE
rForest = RandomForestClassifier()
rfe = RFE(rForest).fit(X, y)
# Rank of 1 indicates the features selected
rfe.ranking_
array([ 51, 1, 1, 114, 1, 1, 135, 75, 1, 1, 130, 72, 1,
1, 79, 1, 34, 1, 1, 1, 1, 87, 1, 1, 1, 1,
1, 1, 58, 108, 1, 1, 1, 20, 1, 1, 1, 1, 1,
1, 1, 84, 1, 65, 32, 1, 1, 131, 138, 107, 1, 116,
16, 1, 66, 110, 1, 1, 128, 14, 46, 1, 61, 113, 1,
7, 1, 1, 106, 1, 115, 6, 1, 124, 1, 104, 109, 1,
133, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 36, 1,
1, 1, 100, 1, 80, 17, 26, 1, 27, 88, 1, 85, 1,
1, 1, 42, 64, 1, 1, 1, 1, 1, 1, 1, 62, 1,
22, 1, 4, 1, 1, 63, 1, 1, 129, 92, 1, 140, 132,
41, 99, 1, 1, 77, 48, 96, 1, 1, 83, 1, 1, 15,
90, 1, 1, 1, 1, 125, 78, 74, 2, 1, 60, 1, 67,
139, 123, 121, 40, 122, 1, 35, 1, 1, 95, 69, 57, 12,
112, 1, 54, 10, 1, 1, 1, 86, 117, 13, 118, 1, 1,
1, 1, 43, 8, 1, 38, 1, 25, 30, 44, 82, 49, 102,
1, 1, 1, 1, 29, 97, 1, 1, 1, 81, 1, 1, 50,
1, 98, 39, 1, 1, 37, 101, 1, 21, 134, 1, 1, 9,
119, 18, 1, 1, 105, 68, 1, 55, 91, 1, 53, 1, 11,
5, 1, 47, 1, 31, 45, 3, 52, 94, 59, 1, 1, 1,
19, 111, 1, 120, 1, 1, 1, 103, 28, 1, 76, 70, 1,
1, 137, 136, 73, 71, 127, 1, 56, 1, 24, 1, 1, 93,
126, 23, 33, 1, 89])
from IPython.display import IFrame
IFrame(
"https://giphy.com/embed/AOSwwqVjNZlDO",
width="100%",
height="600",
scrolling="no",
frameborder="0",
)
Optimizing Recursive Feature Elimination (RFE):
The number of features returning the best balanced_accuracy score is 130 features. This is the optimal number of features at which the random forest algorithm performs best! 
RFE chooses features of the data that are more relevant to the model performing well. Doing so makes the data input to the model less noisy, lets the model train faster, simplifies the model for interpretation, and improves generalisation to outside data.
from yellowbrick.cluster import KElbowVisualizer
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1, 130))
visualizer.fit(X_T) # Fit the data to the visualizer
visualizer.show()
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
is actually an unsupervised machine learning algorithm by itself, but is often used as an Exploratory Data Analysis technique as it reveals natural clusters in the data. The graph above has grouped the remaining 130 features (or skills) into 16 clusters. Those will be used below to find features of importance.
The silhouette ranges from −1 to +1, with a high value indicating that the object is well matched to its own cluster and poorly matched to neighboring clusters. So a value closer to 0, like ours here, indicates the clusters aren't well separated and share similarity to other clusters, which matches what we see with feature variance in that not many features are causing a high amount of variance in the predictive analysis.
kmeans = KMeans(n_clusters=34).fit(X_130)
normalized_vectors = preprocessing.normalize(X_130)
normalized_kmeans = KMeans(n_clusters=34).fit(normalized_vectors)
min_samples = X_130.shape[1] + 1
dbscan = DBSCAN(eps=3.5, min_samples=min_samples).fit(X_130)
# Print results
print(
"kmeans: {0:10.4f}".format(
silhouette_score(X_130, kmeans.labels_, metric="euclidean")
)
)
print(
"Cosine kmeans:{0:10.4f}".format(
silhouette_score(
normalized_vectors, normalized_kmeans.labels_, metric="cosine"
),
)
)
kmeans: 0.0815 Cosine kmeans: 0.1221
The DBSCAN algorithm should be used to find associations and structures in data that are hard to find manually but that can be relevant and useful to find patterns and predict trends. Below, we select the top ten (10) variables with the highest variance. A high variance indicates that this variable is likely to play an important role in creating the clusters.
results = pd.DataFrame(columns=["Variable", "Var"])
for column in df_mean.columns[1:]:
results.loc[len(results), :] = [column, np.var(df_mean[column])]
selected_columns = (
list(
results.sort_values(
"Var",
ascending=False,
)
.head(10)
.Variable.values
)
+ ["dbscan"]
)
tidy = df_scaled[selected_columns].melt(id_vars="dbscan")
sns.barplot(x="dbscan", y="value", hue="variable", data=tidy)
<AxesSubplot:xlabel='dbscan', ylabel='value'>
print(X_130.columns.to_numpy()[66])
print(X_130.columns.to_numpy()[89])
print(X_130.columns.to_numpy()[90])
print(X_130.columns.to_numpy()[91])
print(X_130.columns.to_numpy()[92])
print(X_130.columns.to_numpy()[93])
print(X_130.columns.to_numpy()[94])
print(X_130.columns.to_numpy()[95])
print(X_130.columns.to_numpy()[96])
go mongodb ms sql server mysql node.js object-oriented programming (oop) opencart opencv oracle
# Set y_ticks
y_axis_labels = list(df.index)
# Normalize the data
df_norm_col = (df - df.mean()) / df.std()
# Plot the data
ax = sns.clustermap(df_norm_col, yticklabels=y_axis_labels)
X_factor = factor_data.iloc[:, :]
y_factor = factor_data.index
# Using Kaiser-Meyer-Olkin (KMO) value to validate adequate sample size
kmo_all, kmo_model = calculate_kmo(X_factor)
print(format(kmo_model,".4f"))
0.6660
In order to conduct exploratory factor analysis, the Kaiser-Meyer-Olkin (KMO) value must be greater than 0.5, as this shows that the sample size is fit for factor analysis. The closer the value is to 1, the better. Although the below calculated KMO value isn't ideal, it indicates that factor analysis is viable. This value could be improved upon with a bigger sample size, i.e. by collecting more data.
An eigenvalue returned by factor analysis that is greater than 1, is considered as selection criteria for the feature. This is because the eigenvalue represents the amount of variance created by that variable. So an eigenvalue with a value of 4.04, for example, accounts for variance created by 4.04 variables.
# Create scree plot using matplotlib
plt.scatter(range(1, X_factor.shape[1] + 1), ev)
plt.plot(range(1, X_factor.shape[1] + 1), ev)
plt.title("Scree Plot")
plt.xlabel("Factors")
plt.ylabel("Eigenvalue")
plt.grid()
plt.show()
Factor_1 = []
Factor_2 = []
Factor_3 = []
for idx, row in fa_loadings.iterrows():
if row.Fact1_loadings_ > 0.1:
Factor_1.append(row.Original_Factors)
elif row.Fact2_loadings_ > 0.1:
Factor_2.append(row.Original_Factors)
elif row.Fact3_loadings_ > 0.1:
Factor_3.append(row.Original_Factors)
print("Factor_1: ", Factor_1)
print("\nFactor_2: ", Factor_2)
print("\nFactor_3: ", Factor_3)
Factor_1: ['agile', 'android', 'angular js', 'architecture', 'azure', 'brand design', 'chrome apps', 'cordova', 'd3', 'django', 'drupal', 'flask', 'go', 'grunt', 'java', 'linux', 'oracle'] Factor_2: ['adobe photoshop', 'algorithms', 'amazon web services (aws)', 'aws lambda', 'database modeling', 'eclipse', 'front-end', 'full-stack', 'functional programming', 'jenkins', 'json', 'macos', 'mapreduce', 'matlab', 'node.js', 'object-oriented programming (oop)', 'pandas', 'postgresql', 'python 3', 'rest', 'ruby on rails (ror)', 'scikit-learn', 'test-driven development (tdd)', 'testng'] Factor_3: ['apache spark', 'asp.net', 'cassandra', 'data warehouse', 'database design', 'etl', 'microsoft sql server', 't-sql', 'tableau', 'tensorflow']
How can the viability of the returned factors be measured? With the Cronbach alpha, factors can be measured to determine whether or not the variables of a factor form a “coherent” and reliable factor. A value above 0.6 for the alpha is in practice deemed acceptable. Here the Cronbach alpha is calculated using the pingouin package:
# Get cronbach alpha
factor1_alpha = pg.cronbach_alpha(factor1)
factor2_alpha = pg.cronbach_alpha(factor2)
factor3_alpha = pg.cronbach_alpha(factor3)
print(
"factor1_alpha: ",
factor1_alpha,
"\nfactor2_alpha: ",
factor2_alpha,
"\nfactor3_alpha: ",
factor3_alpha,
)
factor1_alpha: (0.8049658949086095, array([0.782, 0.826])) factor2_alpha: (0.4367866900792546, array([0.371, 0.498])) factor3_alpha: (0.47037097717016946, array([0.407, 0.53 ]))
fig, ax = plt.subplots(figsize=(16, 16))
data_corr = data.corr()
cmap = sns.mpl_palette("Set3", 4)
# Generate a Heatmap
sns.heatmap(data_corr, cbar_kws={"shrink": 0.8}, square=True, cmap=cmap)
# title
title = "Correlation Between All Skills"
plt.title(title, loc="center", fontsize=24)
Text(0.5, 1.0, 'Correlation Between All Skills')
It's important that classes be balanced because with strong class imbalance, a model can seem to perform well simply by predicting that every class is the majority class. But this is not indicative of overall model accuracy.
# Looking at Class Imbalance
sns.countplot(x="Frontend", data=data, palette="hls")
print(
"Percentage of frontend developers is: ",
"{:.2f}".format(len(data[data["profile"] == "Frontend"]) / len(data) * 100),
"%",
)
print(
"Percentage of non-frontend developers: ",
"{:.2f}".format(
(len(data) - len(data[data["profile"] == "Frontend"])) / len(data) * 100
),
"%",
)
plt.show()
Percentage of frontend developers is: 45.94 % Percentage of non-frontend developers: 54.06 %
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
LogisticRegression()
Accuracy can be a useful measure when the same amount of samples exist per class in a dataset. But with an imbalanced set of samples, accuracy is biased to the size of the test data. A more robust scoring metric is the f1 metric, used along with cross_val_score to score the model across multiple test runs.
# Generating predictions by which to measure accuracy of the fitted model
y_pred = logreg.predict(X_test)
print(
"The f1 score of the logistic regression classifier on test set: ",
"{:.3f}".format(
np.mean(cross_val_score(logreg, X_test, y_test, scoring="f1", cv=5))
),
)
The f1 score of the logistic regression classifier on test set: 0.834
This is the basis from which all other metrics are calculated. It provides a simple, but clear, picture which shows how the model is performing.
# Confusion Matrix
cm = metrics.confusion_matrix(y_test, y_pred)
group_names = ["True Negative", "False Positive", "False Negative", "True Positive"]
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [
f"{v1}\n{v2}\n{v3}"
for v1, v2, v3 in zip(group_names, group_counts, group_percentages)
]
labels = np.asarray(labels).reshape(2, 2)
sns.heatmap(cm, annot=labels, fmt="")
<AxesSubplot:>
This is a commonly used curve that summarizes the performance of a classifier over all possible thresholds. It's generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as the threshold for assigning observations to a given class is varied.
plt.subplots(1, figsize=(10, 10))
plt.title("Receiver Operating Characteristic (ROC) : Logistic regression")
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0], c=".7"), plt.plot([1, 1], c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()
In order to return one number that reveals how good the ROC_AUC curve is, calculate the Area Under the ROC Curve, or the ROC_AUC score. This tells us how good the model is at ranking predictions.
# The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes
print(
"The roc_auc_score for the Logistic Regression classifier: ",
"{:.2f}".format(roc_auc_score(y_test, y_score)),
)
The roc_auc_score for the Logistic Regression classifier: 0.96
sdf = sqlCtx.createDataFrame(data)
# Build a dict (orginal column name: new column name)
mapping = {col: col.replace(".", "_").replace(" ", "_") for col in selected_columns}
# Select all columns and create an alias if there is a mapping for this column
df_renamed = sdf.select(
[F.col("`" + c + "`").alias(mapping.get(c, c)) for c in sdf.columns]
)
# Create a VectorAssembler obj using the renamed columns as input
assembler = VectorAssembler(inputCols=list(mapping.values()), outputCol="features")
transformed_data = assembler.transform(df_renamed)
# Rename "Fullstack" column to "label"
data = transformed_data.withColumnRenamed("Fullstack", "label")
In our dataset (train) we have 36.04 % negatives and 63.96 % positives. Since positives are in a majority. Therefore, logistic loss objective function should treat the negative class (Outcome == 0) with higher weight. For this purpose we calculate the BalancingRatio as follows:
Against every Outcome == 0, the calculated BalancingRatio is assigned to the corresponding row under the “classWeights” column. Likewise with every Outcome == 1, 1-BalancingRatio is assigned to the corresponding row in the “classWeights” column.
data = data.withColumn(
"classWeights",
F.when(data.label == 0, BalancingRatio).otherwise(1 - BalancingRatio),
)
from pyspark.ml.classification import LogisticRegression
# Create a logistic regression object
lr = LogisticRegression(
featuresCol="features", labelCol="label", weightCol="classWeights", maxIter=100
)
lrModel = lr.fit(train)
trainingSummary = lrModel.summary
# Plot the recall-precision curve
pr = trainingSummary.pr.toPandas()
plt.plot(pr["precision"], pr["recall"])
plt.xlabel("Precision / 1 - FPR")
plt.ylabel("Recall / TPR")
plt.show()
# Plot the threshold-precision curve
tp = trainingSummary.precisionByThreshold.toPandas()
plt.plot(tp["threshold"], tp["precision"])
plt.xlabel("Threshold")
plt.ylabel("Precision")
plt.show()
# Generating the DecisionTree model
classifier = DecisionTreeClassifier()
# Fitting the model
classifier.fit(X_train, y_train)
DecisionTreeClassifier()
While the below generated classification report reveals this Binary Decision Tree model to score an accuracy rate of 82%, a Multinomial (or Multi-class) model trained with exactly the same data and with the same model specifications will score an accuracy rate of 68% in predicting the five (5) classes present.
# Making predictions with the model
y_pred = classifier.predict(X_test)
# Generating the classification report
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.87 0.84 0.85 140
1 0.81 0.85 0.83 116
accuracy 0.84 256
macro avg 0.84 0.84 0.84 256
weighted avg 0.85 0.84 0.84 256
from yellowbrick.classifier import ROCAUC
# Instantiating the visualizer with the Decision Tree model
visualizer = ROCAUC(classifier, classes=["Frontend", "Non-Frontend"])
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show()
<AxesSubplot:title={'center':'ROC Curves for DecisionTreeClassifier'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>
The tree will stabilize at a max_depth of 125 on its own, but for the sake of this visual, the max_depth has been set to 25.
clf = DecisionTreeClassifier(
max_depth=25
) # max_depth is maximum number of levels in the tree
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(14, 14), dpi=600)
clf = clf.fit(X, y)
plot_tree(clf, filled=True, feature_names=feature_names, class_names=class_names)
fig.set_size_inches(3, 6)
Random Forests see high performance with data of high dimensionality. Furthermore, the algorithm handles outliers and unbalanced data well.
# Generate the RandomForest model
forest = RandomForestClassifier(class_weight="balanced", random_state=0)
# Fit the model
forest.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', random_state=0)
from yellowbrick.classifier import ClassificationReport
visualizer = ClassificationReport(forest, classes=classes, support=True)
visualizer.fit(X_train, y_train) # Fit the visualizer and the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show()
<AxesSubplot:title={'center':'RandomForestClassifier Classification Report'}>
from yellowbrick.classifier import ROCAUC
# Instantiating the visualizer with the Random Forest model
visualizer = ROCAUC(forest, classes=classes)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show()
<AxesSubplot:title={'center':'ROC Curves for RandomForestClassifier'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>
The original data's format required some creativite re-formatting in order to see it correctly processed by our Machine Learning algorithms. But ultimately, we were able to implement a binary classification problem with Logistic Regression & generate viable results with Multi-Class Random Forest model. Although these algorithms are powerful in their ability to make sense of data, it’s clear that the creativity and insight which only people can provide, will continue to be fundamental to the process of Machine Learning.
In the future, I hope to investigate the information supplied by the predict.proba method in more depth. This method shows the probability that an observation belongs to a certain class. It sheds light on which samples the algorithm may be classifying with less certainty versus those classified with more certainty. A closer look at the skills of those profiles may lend insight as to which skills are determinative in predicting a profile type.